Week 9 Exercise: Time Series Data and Analysis Strategies

Author

Chris Zuidema, Elena Austin, and Instructors for ENVH 556 Autumn 2024

This document was rendered on November 25, 2024.

Introduction and Purpose

The purpose of this lab is to practice working with time series data, develop facility in importing and merging data of varying time scales, smoothing approaches, and some analysis tools to reduce the complexity of these data. In addition, as with all the labs in this course, you should focus on writing up a coherent lab report that addresses the scientific context and scientific questions.

In this lab exercise, we will use data collected as part of the MOV-UP mobile monitoring campaign. This data was collected on 1-second and 10-second time scales. The lab will walk you through the steps of importing the raw instrument data. The MOV-UP study has collected data from the following instruments:

  • GPS tracker
  • particle counter (P-Trak, model 8525, TSI Inc., MN)
  • condensation particle counter (CPC, model 3007, TSI Inc., MN)
  • black carbon aethelometer (microAeth AE51)
  • particle sizing spectrometer (NanoScan)
  • carbon dioxide (CO2) (SenseAir K-30-FS)

The full suite of instruments and their characteristics of the platform used in this study can be found in Riley et al., 2014, Larson et al., 2017, and Xiang et al., 2020. The remainder of the code in this week’s lab uses the GPS, P-Trak, and CO2 measurements.

Dates and Times

Managing dates and times in exposure assessment is a common challenge because there are a variety of common date and time formats used to log instrument data. You’ve no doubt encountered the following representations of dates:

  • 25 Nov, 2024
  • 11/25/2024
  • 2024-11-25

With direct-reading instruments made all over the world, you’ll have the potential to encounter many devices that use different date and time formats. You’ll find it advantageous to convert everything to a standardized format, including the time zone. Consistently using a standardized date and time format will make it less likely that you’ll interpret a date and time incorrectly.

Date objects in R are of class Date, and compound date and time (“datetime”) objects are of class POSIXct. Let’s take a look at these object types and show their classes:

#-----datetime.and.date objects-----

# show POSIXct object -- a datetime object with both date and time
Sys.time()
[1] "2024-11-25 10:42:56 PST"
# show class
# Note that 'POSIXt' is a virtual class which cannot be used directly. This
# virtual class exists from which both of the classes (POSIXct and POSIXlt)
# inherit. POSIXt is used to allow operations, such as subtraction, to mix the
# two classes.
class(Sys.time())
[1] "POSIXct" "POSIXt" 
# show Date object-- date without time included
Sys.Date()
[1] "2024-11-25"
# show class
class(Sys.Date())
[1] "Date"

Internally, date and datetime objects store the number of seconds (for POSIXct) or days (for Date) since January 1st, 1970, at midnight. So for the current time and date, these values are 1.7325602^{9} seconds and 2.0052^{4} days. When printed to the console, they are displayed in a human-readable format along with the International Organization for Standardization (ISO) 8601 date and time standard. ISO 8601 provides an unambiguous, well-known, and well-defined method of representing dates and times.

It is often necessary to define date/time components or the format of a datetime object specifically for each instrument. There are several reasons in practice why we need to pay careful attention to this:

  • The time scale of each instrument can be different.
  • The internal instrument clocks can be different from each other and/or the current time.
  • The instruments may be using different time zones, daylight savings time conventions, or there may be issues with simultaneous measurements on either side of the international date line.

Regardless, it is necessary to ensure a common schedule prior to merging the data. Accounting for these issues during the initial data wrangling will make it easier to handle date and datetime variables throughout your analysis process.

Identify Files of Interest, Read Files & Wrangle

Here, we’ll focus on CO2 data from Car 1 (there were 2 cars); for your assignment, we ask you to analyze data from P-Traks, either using the same car or the other car.

In each of the following chunks, we show the tibble. In the .html file, you can scroll through the first 10,000 rows. However, don’t assume that 10,000 is the maximum number of observations in the dataset. To learn that, use glimpse or str.

#-----gather file names for import-----

# get filename paths for data files
filenames <- list.files(data_dir) %>% str_subset("car1")

GPS

#-----gps-----

# Get GPS file name  
gps_file <- str_subset(filenames, "GPS")

# read GPS data
gps <- read_csv(file.path(data_dir, gps_file)) %>%
   
   # set column names
   set_names(c("date", "time", "latitude", "longitude", "altitude", "speed")) %>% 
   
   # create datetime column (notice we must specify the timezone)
   mutate(datetime = as.POSIXct(paste(date, time), tz = "US/Pacific"))

# show dataframe
gps
# A tibble: 15,028 × 7
   date       time     latitude longitude altitude speed datetime           
   <date>     <time>      <dbl>     <dbl>    <dbl> <dbl> <dttm>             
 1 2018-07-19 12:54:08     47.7     -122.     29.5  0.2  2018-07-19 12:54:08
 2 2018-07-19 12:54:09     47.7     -122.     29.9  1.11 2018-07-19 12:54:09
 3 2018-07-19 12:54:10     47.7     -122.     28.2  0.48 2018-07-19 12:54:10
 4 2018-07-19 12:54:11     47.7     -122.     26.4  0.14 2018-07-19 12:54:11
 5 2018-07-19 12:54:12     47.7     -122.     26.3  0    2018-07-19 12:54:12
 6 2018-07-19 12:54:13     47.7     -122.     24.1  0.09 2018-07-19 12:54:13
 7 2018-07-19 12:54:14     47.7     -122.     24    0    2018-07-19 12:54:14
 8 2018-07-19 12:54:15     47.7     -122.     24    0    2018-07-19 12:54:15
 9 2018-07-19 12:54:16     47.7     -122.     24    0    2018-07-19 12:54:16
10 2018-07-19 12:54:17     47.7     -122.     23.9  0    2018-07-19 12:54:17
# ℹ 15,018 more rows
# and time zone
tz(gps$datetime)
[1] "US/Pacific"

P-Trak

These instruments collect ultrafine particles (UFPs) at 1-second time resolution. In order to capture the change in ultrafine particles, two ultrafine particle counters (P-Trak) instruments were used. One instrument was operated under standard conditions and reported the total particle number concentration (in units of particles/cm3, #/cm3, pt/cm3, or 1/cm3) for all particles with diameters between 20 nm - 1 um. The second P-Trak operated using a screened inlet. The mesh size used and the number of layers of mesh were selected to intercept the smallest particles as they entered the system and changed the lower particle size to 36 nm. The difference between the unscreened and screened P-Trak instruments provided an estimate the particle concentration of 20-36 nm particles. This was verified using a particle Scanning Mobility Particle Sizer Spectrometer.

#-----p-trak-----

# get p-trak not-screened file names
ptrak_noscreen_file <- filenames %>% 
   str_subset("PT") %>% 
   str_subset("scrnd|Scrnd|Screened|screen", negate = TRUE)

# read data
ptrak_noscreen <-  read_tsv(file.path(data_dir, ptrak_noscreen_file), skip = 29) %>% 
   
   # name columns
   set_names(c("date","time", "ptrak_noscreen_conc")) %>% 
   
   # modify date column and create datetime column (notice we must specify the
   # date format)
   mutate(date = as.Date(date, format = "%m/%d/%Y"), 
          datetime = as.POSIXct(paste(date, time), tz = "US/Pacific") )


# get p-trak-screened file names and read in data
ptrak_screen_file <- filenames %>% 
   str_subset("scrnd|Scrnd|Screened|screen")

# read data
ptrak_screen <-  read_tsv(file.path(data_dir, ptrak_screen_file), skip = 29) %>% 
   
   # name columns
   set_names(c("date","time", "ptrak_screen_conc")) %>% 
   
   # modify date column and create datetime column
   mutate(date = as.Date(date, format = "%m/%d/%Y"), 
          datetime = as.POSIXct(paste(date, time), tz = "US/Pacific") )


# show dataframes - have PNC measurements every second
ptrak_noscreen
# A tibble: 16,240 × 4
   date       time     ptrak_noscreen_conc datetime           
   <date>     <time>                 <dbl> <dttm>             
 1 2018-07-19 12:35:01                4190 2018-07-19 12:35:01
 2 2018-07-19 12:35:02                4050 2018-07-19 12:35:02
 3 2018-07-19 12:35:03                4110 2018-07-19 12:35:03
 4 2018-07-19 12:35:04                4120 2018-07-19 12:35:04
 5 2018-07-19 12:35:05                4100 2018-07-19 12:35:05
 6 2018-07-19 12:35:06                4010 2018-07-19 12:35:06
 7 2018-07-19 12:35:07                4110 2018-07-19 12:35:07
 8 2018-07-19 12:35:08                4070 2018-07-19 12:35:08
 9 2018-07-19 12:35:09                4060 2018-07-19 12:35:09
10 2018-07-19 12:35:10                4050 2018-07-19 12:35:10
# ℹ 16,230 more rows
ptrak_screen
# A tibble: 16,243 × 4
   date       time     ptrak_screen_conc datetime           
   <date>     <time>               <dbl> <dttm>             
 1 2018-07-19 12:35:01              1890 2018-07-19 12:35:01
 2 2018-07-19 12:35:02              1920 2018-07-19 12:35:02
 3 2018-07-19 12:35:03              1850 2018-07-19 12:35:03
 4 2018-07-19 12:35:04              1840 2018-07-19 12:35:04
 5 2018-07-19 12:35:05              1810 2018-07-19 12:35:05
 6 2018-07-19 12:35:06              1820 2018-07-19 12:35:06
 7 2018-07-19 12:35:07              1870 2018-07-19 12:35:07
 8 2018-07-19 12:35:08              1860 2018-07-19 12:35:08
 9 2018-07-19 12:35:09              1850 2018-07-19 12:35:09
10 2018-07-19 12:35:10              1840 2018-07-19 12:35:10
# ℹ 16,233 more rows

Carbon Dioxide

#-----co2-----

# get CO2 file name
co2_file <- str_subset(filenames, "CO2")

# read data (tab separated values), specify time as character because of time parsing issue
co2 <- read_tsv(file.path(data_dir, co2_file), skip = 1, 
                col_types = cols(`System_Time_(h:m:s)` = col_character()) 
                ) %>% 
   
   # name columns
   setNames(c("date", "time", "co2_conc", "h2o_conc", "temp", "pressure", 
              "co2_absorp", "flow") ) %>% 
   
   # select columns of interest
   select(date, time, co2_conc) %>% 
   
   # address time parsing (some minute values have only one digit) and 
   # create datetime column
   mutate(time = as_hms(time), 
          datetime = as.POSIXct(paste(date, time), tz = "US/Pacific") )

# show dataframe - have CO2 measurements every second
head(co2)
# A tibble: 6 × 4
  date       time     co2_conc datetime           
  <date>     <time>      <dbl> <dttm>             
1 2018-07-19 12:53:18     407. 2018-07-19 12:53:18
2 2018-07-19 12:53:19     407. 2018-07-19 12:53:19
3 2018-07-19 12:53:20     406. 2018-07-19 12:53:20
4 2018-07-19 12:53:21     406. 2018-07-19 12:53:21
5 2018-07-19 12:53:22     406. 2018-07-19 12:53:22
6 2018-07-19 12:53:23     406. 2018-07-19 12:53:23
str(co2$datetime)
 POSIXct[1:15156], format: "2018-07-19 12:53:18" "2018-07-19 12:53:19" "2018-07-19 12:53:20" ...
tz(co2$datetime)
[1] "US/Pacific"

Merge & Inspect Instrument Data

The purrr package from tidyverse is a functional programming tool that allows you to apply a function iteratively over a list. We use that next to merge (or “join”) the datasets. Note that the full_join function searches for common column names and joins on these. Alternatively you can add this argument: by = c("date", "time", "datetime").

We print out the dataset to get a basic idea of whether the merge worked. Ask yourself questions like the following:

  • Are there missing data and what is its pattern?
  • Do the variables (columns) in the final dataset make sense to you?

We observe that early rows of the dataset have missing GPS and CO2 data, so we know this is something we need to investigate further.

#-----merge datasets-----

# create a list of all data to merge 
instrument_list <- list(gps = gps, ptrak_noscreen = ptrak_noscreen, 
                        ptrak_screen = ptrak_screen, co2 = co2)

# use `purrr::reduce` to run `full_join()` on listed items 
all_data <- reduce(instrument_list, full_join) %>% 
   
   # reorder columns so datetime is first
   relocate(datetime) %>% 
   
   # arrange data in datetime order
   arrange(datetime)
Joining with `by = join_by(date, time, datetime)`
Joining with `by = join_by(date, time, datetime)`
Joining with `by = join_by(date, time, datetime)`
# use glimpse to learn more about the data
glimpse(all_data)
Rows: 16,255
Columns: 10
$ datetime            <dttm> 2018-07-19 12:35:01, 2018-07-19 12:35:02, 2018-07…
$ date                <date> 2018-07-19, 2018-07-19, 2018-07-19, 2018-07-19, 2…
$ time                <time> 12:35:01, 12:35:02, 12:35:03, 12:35:04, 12:35:05,…
$ latitude            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ longitude           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ altitude            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ speed               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ptrak_noscreen_conc <dbl> 4190, 4050, 4110, 4120, 4100, 4010, 4110, 4070, 40…
$ ptrak_screen_conc   <dbl> 1890, 1920, 1850, 1840, 1810, 1820, 1870, 1860, 18…
$ co2_conc            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Plot Data

Time Series

Let’s make a quick visualization of the concentration data to get an idea of the data we’re working with.

What features do you see?

#-----plot data-----

# note that these data are all for one day
plot_date <- first(all_data$date)

# transform data from wide to long for ggplot
all_data %>% pivot_longer(cols = contains("_conc"), 
                          names_to = "pollutant", 
                          values_to = "concentration") %>% 
# pipe into ggplot
ggplot(aes(x = datetime, y = concentration)) + 
  
  # plot each pollutant in a facet
  facet_wrap(~pollutant, ncol = 1, scales = "free_y") + 
  
  # specify type of plot
  geom_line() +
  
  # specify theme
  theme_bw() + 
  labs(title = plot_date)

Data Availability

Data availability plots can be a useful initial check to visualize what data we have across instruments over time. We show one variable from each dataset. In this case the plot is not very informative because the data are fairly complete and the plot doesn’t make it easy to see the limited amount of missing data that are present even though we already observed that there are missing data after merging.

#-----data availability plot-----

# transform data from wide to long for ggplot (using latitude for GPS)
all_data %>% 
  mutate(GPS = latitude) %>%
  pivot_longer(cols = c(contains("conc"), GPS), 
               names_to = "measurement") %>%

# make plot
ggplot(aes(x = datetime, y = measurement, color = factor(measurement)) ) +
  geom_point(size = 0.5, alpha=0.4) + 
  labs(x = "Time", y = "Instrument") + 
  theme_bw() + 
  theme(legend.position = "none")

Missing Data

Next we try to precisely determine how much missing data we have and what are the likely sources of missingness in our data. First we calculate the total and percent missing for every variable in the dataset. We observe that the presence of missing data depends upon the instrument.

#-----missing-----

lapply(all_data, function(i){ 
   
   tibble( 
          # sum missing
          n_miss = sum(is.na(i)), 
          
          # percent missing
          perc_miss = round(n_miss/length(i) * 100, 1)
          )
   }) %>% 
   
   # bind list
   bind_rows(.id = "variable")
# A tibble: 10 × 3
   variable            n_miss perc_miss
   <chr>                <int>     <dbl>
 1 datetime                 0       0  
 2 date                     0       0  
 3 time                     0       0  
 4 latitude              1227       7.5
 5 longitude             1227       7.5
 6 altitude              1227       7.5
 7 speed                 1227       7.5
 8 ptrak_noscreen_conc     13       0.1
 9 ptrak_screen_conc       10       0.1
10 co2_conc              1097       6.7

Time Interval

Our data appeared to be on a similar 1-second time interval. Let’s confirm by calculating the difference in time between consecutive measurements in each of our datasets:

#-----time stamps-----

# we'll use the instrument data list
lapply(instrument_list, function(i){
   
   # get vector of datetimes
   datetimes <- i[["datetime"]]
   
   # get a vector of the differences between consecutive measurements
   time_diff <- datetimes - lag(datetimes)
   
   # summarize differences
   table(time_diff)
   
}) %>% 
   
   # use list names
   set_names(names(instrument_list))
$gps
time_diff
    0     1     2    12    17 
    2 15006    17     1     1 

$ptrak_noscreen
time_diff
    1 
16239 

$ptrak_screen
time_diff
    1 
16242 

$co2
time_diff
    1 
15155 
#  Most measuremnets occur every second.
# note differences in the GPS - a couple of measurements appear to be duplicates. A handful of measurements have larger gaps. 

It seems that some of our missingness is due to instrument reporting. Some measurements did not occur at 1-second intervals. From this we can also see there are duplicate measurements for some instruments (difference of zero seconds).

Lastly, it appears that the GPS instrument was not turned on at the same time as the other instruments (there is no GPS data at the beginning of the all_data dataframe). This may be due to the fact that the other instruments were turned on for “warm-up” before the measurement campaign started.

Let’s drop data missing the GPS variables for convenience, but is this the best thing to under all circumstances and for all variables?

#-----drop missing gps-----

all_data <- drop_na(all_data, latitude, longitude)

Timeseries Data

Time series data are marked by measurements that are indexed to a time component. There are many R standards for time series data: ts, xts, data.frame, data.table, tibble, zoo, tsibble, tibbletime or timeSeries. The package tsbox has many useful functions for converting between these time series formats.

We’ve focused most of our effort on tidyverse tools this term, so let’s concentrate on the tsibble, feasts, and slider package functions.

First we want to turn our dataset into a tsibble. We use the function try so that knitting our document doesn’t fail if this operation fails. We observe that it does fail because there were duplicates, so then we inspect them.

#-----try tsibble-----

# # use "try" here so document knits
# try( as_tsibble(all_data, index = datetime) )

# inspect duplicate rows
duplicates(all_data, index = datetime)
# A tibble: 4 × 10
  datetime            date       time     latitude longitude altitude speed
  <dttm>              <date>     <time>      <dbl>     <dbl>    <dbl> <dbl>
1 2018-07-19 12:59:59 2018-07-19 12:59:59     47.7     -122.      7.4     0
2 2018-07-19 12:59:59 2018-07-19 12:59:59     47.7     -122.      7.4     0
3 2018-07-19 13:00:00 2018-07-19 13:00:00     47.7     -122.      7.4     0
4 2018-07-19 13:00:00 2018-07-19 13:00:00     47.7     -122.      7.4     0
# ℹ 3 more variables: ptrak_noscreen_conc <dbl>, ptrak_screen_conc <dbl>,
#   co2_conc <dbl>

tsibble alerts us of our time issues and prompts us to deal with them, so let’s remove the duplicates (you could average duplicates also).

#-----to tsibble-----

# remove duplicate rows, and convert to `tsibble`
ts_data <- all_data %>% 
  distinct(datetime, .keep_all = TRUE) %>% 
  as_tsibble(index = datetime)

head(ts_data)
# A tsibble: 6 x 10 [1s] <US/Pacific>
  datetime            date       time     latitude longitude altitude speed
  <dttm>              <date>     <time>      <dbl>     <dbl>    <dbl> <dbl>
1 2018-07-19 12:54:08 2018-07-19 12:54:08     47.7     -122.     29.5  0.2 
2 2018-07-19 12:54:09 2018-07-19 12:54:09     47.7     -122.     29.9  1.11
3 2018-07-19 12:54:10 2018-07-19 12:54:10     47.7     -122.     28.2  0.48
4 2018-07-19 12:54:11 2018-07-19 12:54:11     47.7     -122.     26.4  0.14
5 2018-07-19 12:54:12 2018-07-19 12:54:12     47.7     -122.     26.3  0   
6 2018-07-19 12:54:13 2018-07-19 12:54:13     47.7     -122.     24.1  0.09
# ℹ 3 more variables: ptrak_noscreen_conc <dbl>, ptrak_screen_conc <dbl>,
#   co2_conc <dbl>

Let’s also fill the gaps in time. In other words, we have decided in this analysis that the index column datetime needs to be unique and complete. In standard time series analyses the time increment between every consecutive observation is the same. Adding these new time rows introduces “explicit” NAs to the data, so you’ll then have a choice about if and how to fill in the non-time data. tidyr::fill() can help with this task, and the choice of how to fill may be more or less important depending on the time scale you’re working on, the amount of missingness, and if there are consecutive measurements missing. Given the small amount of missing and the one-second time scale of the data, it’s not unreasonable to introduce a few rows with NAs here.

#-----fill ts-----

# count rows before filling
paste("number of rows before filling gaps:", nrow(ts_data))
[1] "number of rows before filling gaps: 15026"
# save for calculation below 
temp <- ts_data

# Turn implicit missing values into explicit missing values (add NAs to data)
ts_data <- fill_gaps(ts_data) 
  
# # if you wanted to "fill" the new explicitly missing data with a value drawn
# # from the dataset, here is an option. (add a `%>%` to the function above). 
# # Note this could have downstream effects. 
# fill(co2_conc, .direction = "down")

# count rows after filling
paste("number of rows after filling gaps:", nrow(ts_data))
[1] "number of rows after filling gaps: 15070"
# see what rows were added - these contain NA values other than dates
differences_df <- anti_join(ts_data, temp)
Joining with `by = join_by(datetime, date, time, latitude, longitude, altitude,
speed, ptrak_noscreen_conc, ptrak_screen_conc, co2_conc)`
head(differences_df)
# A tsibble: 6 x 10 [1s] <US/Pacific>
  datetime            date   time   latitude longitude altitude speed
  <dttm>              <date> <time>    <dbl>     <dbl>    <dbl> <dbl>
1 2018-07-19 13:00:01 NA        NA        NA        NA       NA    NA
2 2018-07-19 13:07:59 NA        NA        NA        NA       NA    NA
3 2018-07-19 13:09:21 NA        NA        NA        NA       NA    NA
4 2018-07-19 13:09:22 NA        NA        NA        NA       NA    NA
5 2018-07-19 13:09:23 NA        NA        NA        NA       NA    NA
6 2018-07-19 13:09:24 NA        NA        NA        NA       NA    NA
# ℹ 3 more variables: ptrak_noscreen_conc <dbl>, ptrak_screen_conc <dbl>,
#   co2_conc <dbl>
paste("number of rows added:", nrow(differences_df))
[1] "number of rows added: 44"

Temporal Autocorrelation

Time series data is often correlated in time - measurements taken close in time to one another are more similar than measurements taken farther apart. An autocorrelation plot helps identify at what lag (or time interval) data are less correlated. In the following plot, the blue lines indicate bounds on the autocorrelation of these data, i.e., which autocorrelations statistically different from zero. At what time do the CO2 appear less correlated?

The Autocorrelation Function (ACF) measures the overall correlation between a time series and its lagged values, including both direct and indirect effects from intermediate lags. In contrast, the Partial Autocorrelation Function (PACF) isolates the direct relationship between a time series and a specific lag, removing the influence of intervening lags. Use the ACF to understand the general structure of dependencies in a time series, and the PACF to determine the number of significant lags for autoregressive models.

#-----autocorrelation-----

# autocorrelation plot - shows the overall correlation structure across lags
## calculates correlation between a conc and its lagged versions (both immediate/direct and indirect)

## ACF measures the linear (Pearson correlation coef, R) correlation between a time series and its lagged versions in a tsibble.  

## plot: A high positive correlation suggests that the values are similar between the time points separated by the lag. Values near zero indicate no correlation at that lag. 
## The blue dashed line may be the approximate 95% CI where no autocorrelation exists in the data.
ts_data %>% ACF(co2_conc,  # column used for computation
                lag_max = 60 # maximum lag at which to calculate the acf
                ) %>% autoplot()

# partial autocorrelation plot
## shows only the correlation between a conc (e.g. at time t) and a lag conc  (e.g. at time t-2), adjusting for the influence of intermediate conc lags(e.g. at time t-1).

## The first lag (lag 1) has a high and statistically significant partial autocorrelation (above the blue dashed line), indicating that the immediate prior value of the time series has a strong direct influence on the current value.
## After lag 1 (and possibly lag 2 or 4), the partial autocorrelations become less significant (fall within the blue dashed lines). Beyond the first lag, there is thus minimal direct influence of earlier values on the current value once the effects of lag 1 (and other significant lags) are accounted for.

ts_data %>% PACF(co2_conc, lag_max = 60) %>% autoplot()

Temporal Smoothing with New Time Scales – Moving Averages

A common task with time series data is averaging to different time scales. From the autocorrelation plot above, we can see at a lag of 60 seconds, the CO2 concentrations are no longer correlated. Let’s convert our 1-second data to some longer time scales. One package that helps with averaging tasks is slider.

For the first average we’ll use the time scale period (minutes) to calculate 1-minute averages. Though with other data, this could also be months, weeks, years, etc. This is a “block average” where the period we are averaging over is based on an attribute of the data. For example, if you wanted monthly averages, this method would be “aware” of the fact there are a different number of days in each month. For the second average (a 2-minute average) we’ll demonstrate a moving (“sliding” or “rolling”) average, where the number of neighbors before and after a value of interest are used to compute an average.

Notice how both functions preserve the input dataframe size. For the period average we can then easily merge the new column into our existing dataframe (however, you’ll notice values are repeated 60 times for each minute). The moving average works nicely with mutate().

slider

#-----slider-----

ts_data <- ts_data %>% 
  
  # left join new 1-minute period values
  left_join(

    # Approach 1: aggregate data to 1 min averages using fixed period intervals
    ## returns a dataframe with repeated 1 min avg measures every second
    slide_period_dfr(ts_data, ts_data$datetime, "minute",
                     ~tibble(datetime = floor_date(.x$datetime),
                             co2_conc_one = mean(.x$co2_conc)
                             ) ),
    by = "datetime" ) %>% 
  
  # add a datetime column with 1-minute precision to help if you choose to
  # `group_by()` and `summarise()` 1-minute averages
  mutate(datetime_one = ymd_hm(format(datetime, "%Y-%m-%d %H:%M"), 
                               tz = "US/Pacific")) %>%
  
  # Approach 2: calculate moving averages, e.g., 2 minutes to provide a smoothed version of the time series(this works because we've made sure our datetime index is unique and complete)
  mutate(co2_conc_two = slide_dbl(co2_conc, ~mean(.x, na.rm = TRUE), 
                                  .before = 60, .after = 60) ) 

# # slide_period also works in the same way, but it returns a vector of different
# # length than the original inputs
# with(ts_data, 
#      slide_period_dbl(.x = co2_conc, .i = datetime, 
#              .period = "minute", .f = mean, na.rm = TRUE) 
#      )

# glimpse
glimpse(ts_data)
Rows: 15,070
Columns: 13
$ datetime            <dttm> 2018-07-19 12:54:08, 2018-07-19 12:54:09, 2018-07…
$ date                <date> 2018-07-19, 2018-07-19, 2018-07-19, 2018-07-19, 2…
$ time                <time> 12:54:08, 12:54:09, 12:54:10, 12:54:11, 12:54:12,…
$ latitude            <dbl> 47.65177, 47.65179, 47.65176, 47.65175, 47.65175, …
$ longitude           <dbl> -122.3047, -122.3047, -122.3047, -122.3047, -122.3…
$ altitude            <dbl> 29.5, 29.9, 28.2, 26.4, 26.3, 24.1, 24.0, 24.0, 24…
$ speed               <dbl> 0.20, 1.11, 0.48, 0.14, 0.00, 0.09, 0.00, 0.00, 0.…
$ ptrak_noscreen_conc <dbl> 4840, 4830, 4790, 4910, 4890, 4830, 4910, 4870, 48…
$ ptrak_screen_conc   <dbl> 1830, 1820, 1790, 1860, 1810, 1860, 1890, 1910, 18…
$ co2_conc            <dbl> 410.614, 409.812, 409.423, 409.397, 409.078, 409.0…
$ co2_conc_one        <dbl> 408.105, 408.105, 408.105, 408.105, 408.105, 408.1…
$ datetime_one        <dttm> 2018-07-19 12:54:00, 2018-07-19 12:54:00, 2018-07…
$ co2_conc_two        <dbl> 407.9190, 407.9005, 407.8832, 407.8664, 407.8519, …
# compare a few values. convert to dataframe since R does not round/limit the displayed digits
head(select(ts_data, contains(c("datetime", "co2"))) %>% as.data.frame())
             datetime        datetime_one co2_conc co2_conc_one co2_conc_two
1 2018-07-19 12:54:08 2018-07-19 12:54:00  410.614      408.105     407.9190
2 2018-07-19 12:54:09 2018-07-19 12:54:00  409.812      408.105     407.9005
3 2018-07-19 12:54:10 2018-07-19 12:54:00  409.423      408.105     407.8832
4 2018-07-19 12:54:11 2018-07-19 12:54:00  409.397      408.105     407.8664
5 2018-07-19 12:54:12 2018-07-19 12:54:00  409.078      408.105     407.8519
6 2018-07-19 12:54:13 2018-07-19 12:54:00  409.083      408.105     407.8316

tsibble

Next, let’s take a look at a tsibble approach for 5-minute averages. Notice that the size of the dataframe decreases.

#-----new timescales-----

ts_new <- ts_data %>% 
  
  # get the "floor" of each datetime row (unfortunately `tsibble` doesn't let us
  # use "datetime" for this new variable name)
  index_by(datetime_new = floor_date(datetime, unit = "5mins")) %>%
  
  # summarise the mean of rows across all dataframe columns
  summarise(across(where(is.numeric), mean, na.rm = TRUE ), .groups = "drop") %>% 
  
  # rename to get "datetime" variable name back
  rename(datetime = datetime_new) %>% 
  
  # create variable so new 5 min averages are aligned (i.e., centered) with originals for plots (because of `floor_date()` behavior)
  mutate(plot_time = datetime + minutes(2) + seconds(30))
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
ℹ In group 1: `datetime_new = 2018-07-19 12:50:00`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
# glimpse
glimpse(ts_new)
Rows: 52
Columns: 11
$ datetime            <dttm> 2018-07-19 12:50:00, 2018-07-19 12:55:00, 2018-07…
$ latitude            <dbl> 47.65173, 47.65988, 47.64919, 47.62773, 47.57721, …
$ longitude           <dbl> -122.3048, -122.3017, -122.3056, -122.3266, -122.3…
$ altitude            <dbl> 16.29423, 2.14000, 16.03077, 53.02279, 25.33767, 1…
$ speed               <dbl> 2.489231e+00, 1.899750e+01, 3.735408e+01, 4.930044…
$ ptrak_noscreen_conc <dbl> 5074.231, 4587.667, 4815.050, 13524.228, 13208.933…
$ ptrak_screen_conc   <dbl> 1875.962, 2117.467, 3505.385, 12188.162, 9296.767,…
$ co2_conc            <dbl> 408.1050, 419.2890, 471.7210, 495.6258, 492.0209, …
$ co2_conc_one        <dbl> 408.1050, 419.2890, 479.8304, 485.7061, 492.0209, …
$ co2_conc_two        <dbl> 408.0122, 419.8052, 470.8984, 498.4439, 491.6913, …
$ plot_time           <dttm> 2018-07-19 12:52:30, 2018-07-19 12:57:30, 2018-07…

Following up on our autocorrelation plot above, notice how it changes. The lags are now 5 minutes long and the evidence in the autocorrelation is only in the first 2-3 lags. This is a longer duration in time than we saw with the 1-second data, but a fewer number of lags.

#---- 5-min autocorrelation ----

# check 5-minute average autocorrelation
ts_new %>% ACF(co2_conc) %>% autoplot()

All our example data are on the 1-second time scale; however, it is common to have multiple time scales in a suite of direct-reading instruments. Using the techniques above, you’ll be able to average different time intervals to a common schedule. Once the measurements are on a shared time scale, they can be merged/joined as we have shown in the “Merge & Inspect Instrument Data” section.

Plot Different Time Scales

Let’s inspect the time series of our different averages:

#-----plot different time averages-----

# we're going to use a simpler approach here and just manually specify each
# series rather than make a dataframe with all the plotting data combined

ggplot() + 
  
  # plot original 1-second data
  geom_line(data = ts_data, aes(x = datetime, y = co2_conc, color = "blue")) + 
  
  # plot `slider` 1-min period averages
  geom_line(data = ts_data, aes(x = datetime, y = co2_conc_one, color = "black")) +
  
  # plot `slider` 2-min moving averages
  geom_line(data = ts_data, aes(x = datetime, y = co2_conc_two, color = "red")) +
  
  # plot `tsibble` 5-minute block averages (note `x = plot_time`)
  geom_line(data = ts_new, aes(x = plot_time, y = co2_conc, color = "darkgreen")) +  
  
  # specify legend values manually
  scale_color_identity(name = "Time Average", 
                       breaks = c("blue", "black", "red", "darkgreen"), 
                       labels = c("1-sec", "1-min", "2-min", "5-min"), 
                       guide = "legend") +
  
  # labels
  labs(x = "Time", 
       y = "CO2 (ppm)" 
       ) +
  
  # theme
  theme_bw() + 
  theme(legend.position = "bottom")

Map Concentrations

Our data was from a mobile monitoring campaign - so let’s map the concentrations.

In the following chunk we select the data corresponding to the geographic area of interest, define the bounding box (extent of the mapping area), and download a background (tile) map.

#-----get map objects-----
# WGS84 latitude-longitude. same as: "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
latlong_proj <- 4326  

# projection in meters we need for distance calculations
lambert_proj <- "+proj=lcc +lat_1=33 +lat_2=45 +lat_0=39 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs" 


# define the bounding box for the map
bbox <- ts_data %>% 
  
  # first filter to time period (and therefore geographic area) of interest
  filter(between(datetime, 
                 as.POSIXct("2018-07-19 13:13:00", tz = "US/Pacific"),
                 as.POSIXct("2018-07-19 16:32:00", tz = "US/Pacific")) ) %>% 
  drop_na(latitude, longitude) %>%
  st_as_sf(coords = c('longitude', 'latitude'), crs = latlong_proj) %>%
  # add a buffer (in meters)
  st_transform(crs = lambert_proj) %>%
  st_buffer(1e4) %>%
  st_transform(crs = latlong_proj) %>%
  # then pipe into `make_bbox()` 
  #make_bbox()
  st_bbox()

# note you could write code to save this tile plot for offline use. This can have technical issues, however, so we won't do that here. 

tiles <- maptiles::get_tiles(x = bbox, provider = "OpenStreetMap", zoom = 12)

basemap <- ggplot() +
    # Add basemap tiles as background
    ggspatial::layer_spatial(tiles) + 
    theme_minimal()

basemap

To help represent the time-varying nature of the data in this lab, let’s collect the times of some of our measurements to help our visualization.

#-----get time stamps-----

map_times <- ts_data %>% 
  
  # convert to `tibble` because we're breaking up our `tsibble`
  as_tibble() %>%
  
  # filter to rows with :00, :15, :30, and :45 minute times
  filter(str_detect(time, pattern = ":00:00|:15:00|:30:00|:45:00")) %>% 
  
  # make a map_time variable, dropping the seconds 
  # (`$` indicates the character pattern should be at the end of the string)
  mutate(map_time = str_remove(time, pattern = ":00$"))

Map Carbon Dioxide Concentrations

#-----map concentrations-----

# map concentrations
map_co2 <- basemap +
  
  # map locations with concentrations
  geom_point(data = drop_na(ts_data, co2_conc), 
             aes(x = longitude, y = latitude, color = co2_conc)) + 
  
  # add measurement times to map
  geom_text_repel(data = map_times,
                  aes(x = longitude, y = latitude, label = map_time), 
                  force_pull = -0.02 ) +
  
  # color scale
  scale_color_continuous(type = "viridis") +  
  
  # labels
  labs(color = "CO2 (ppm)") +
  
  # Bounding box limits for latitude and longitude
  coord_sf(xlim = c(bbox$xmin, bbox$xmax),
           ylim = c(bbox$ymin, bbox$ymax)) +
  
  # theme
  theme_void() 

# show map
map_co2

Background Standardization

We show different approaches to estimating the background in this section and plot them for comparison.

Rolling Percentile & Rolling Minimum

Brantley et al., 2014 summarize a number of methods to calculate background concentrations. One approach described by Bukowiecki et al., 2002, includes taking the 5th percentile of 1 or 5 min averages. As we talked about in class, another approach is to calculate the moving minimum of 30-minute time intervals. slider simplifies these types of moving calculations.

#-----calculate background with moving functions-----

# specify the time in minutes for moving calculations 
# (for percentile `_pct` and minimum `_min`)
time_pct <- 5
time_min <- 30

# specify the percentile considered to be background
pct <- 0.05

# use use `slider` to add moving background calculations to `ts_data`. Then
# calculate an adjusted concentration by removing the background concentration.
# For moving percentile, assign new names with suffix `_pct`, and for moving
# minimums use `_min`. `min()` will return Inf when all values are NA; use
# `na_if()` to avoid.
ts_data <- ts_data %>% 
  mutate( 
         # using moving percentiles
         co2_background_pct = slide_dbl(co2_conc, 
                                    ~quantile(.x, probs = pct, na.rm = TRUE), 
                                    .before = (time_pct*60)/2, 
                                    .after = (time_pct*60)/2), 
         co2_adj_pct = co2_conc - co2_background_pct, 
         
         # using moving minimums
         co2_background_min = slide_dbl(co2_conc, 
                                 ~min(.x, na.rm = TRUE), 
                                 .before = (time_min*60)/2, 
                                 .after = (time_min*60)/2), 
         co2_background_min = na_if(co2_background_min, Inf), 
         co2_adj_min = co2_conc - co2_background_min
         )

For the percentile approach we’ve taken 5 minutes as the time interval and 5th percentile of the data to calculate the background CO2 concentration. Let’s compare our two background estimation approaches in one plot:

#-----plot background-----

# plot the two background estimates on the same plot
ggplot(data = ts_data) + 
  geom_line(aes(x = datetime, y = co2_background_pct, color = "black")) + 
  geom_line(aes(x = datetime, y = co2_background_min, color = "red")) + 
  labs(x = "Time", y = "CO2 Background (ppm)") +

# specify legend values manually
scale_color_identity(name = "Background Type",
                   breaks = c("black", "red"),
                   labels = c("Percentile", "Minimum"),
                   guide = "legend") +
# theme
theme_bw() + 
theme(legend.position = "bottom")

Loess Smoother of Background

We can further smooth our estimates of background. These background estimates, even when smoothed, appear to be picking up more than background. However, in a later plot we consider the background estimates in the context of the data, which are much higher, so the apparent high variation in background isn’t that large in the context of the complete CO2 dataset.

#-----loess smooth of moving background-----

# specify loess models
mdl_pct <- loess(co2_background_pct ~ seq_along(time), data = ts_data, 
                 span = 0.1, degree = 1)

mdl_min <- loess(co2_background_min ~ seq_along(time), data = ts_data, 
                 span = 0.1, degree = 1)

# add smoothed predictions of background to `ts_data`
ts_data <- ts_data %>% 
  mutate(co2_background_pctl = predict(mdl_pct, ts_data),
         co2_background_minl = predict(mdl_min, ts_data)) 

# prepare dataframe for ggplot 
# (creating new columns for our different types of estimates)
ts_data %>% pivot_longer(contains("_background_")) %>% 
  mutate(min_pct = if_else(str_detect(name, "min"), "minimum", "percentile"), 
         smooth_un = if_else(str_detect(name, "l"), "smoothed", "unsmoothed") ) %>% 

# compare all background estimates
ggplot(aes(x = datetime, y = value, 
           linetype = interaction(smooth_un, min_pct, sep = " "), 
           color = interaction(smooth_un, min_pct, sep = " ") ) ) + 
  
  # plot lines
  geom_line() + 
  
  # set colors
  scale_color_manual(name = "Background Type", 
                     values = c("red", "red", "black", "black")) +
  
  # set linetype
  scale_linetype_manual(name = "Background Type", 
                        values = c("dashed", "solid", "dashed", "solid")) + 
  
  # labels
  labs(x = "Time", y = "CO2 Background (ppm)") +
  
  # theme
  theme_bw() + 
  theme(legend.position = "bottom")

Plot Carbon Dioxide with Background

We’ll demonstrate with the the rolling minimum approach, showing both the unsmoothed and loess smoothed versions.

#-----plot co2 and background-----

ggplot(data = ts_data) + 
  
  # plot the co2 observations
  geom_point(aes(x = datetime, y = co2_conc), size = 0.8, color = "darkgray") +
  
  # plot co2 background using the loess smoothed and unsmoothed rolling minimums
  geom_line(aes(x = datetime, y = co2_background_minl), lty = "dashed", color = "red") + 
  geom_line(aes(x = datetime, y = co2_background_min), lty = "solid", color = "red") + 
  
  # labels
  labs(x = "Time", y = "CO2 with Background (ppm)") +
  
  # theme
  theme_bw()

Identifying Peak Concentrations

As we’ve seen, mobile monitoring data has a mix of temporal and spatial variability. In an effort to disentangle temporal confounding from spatial analysis, we can calculate the background concentration of a pollutant. Peak identification can then be performed on the background adjusted data to identify single or multi-pollutant extreme features (for example, departures above a particular quantile of single pollutant features vs departure above background of a multivariate representation such as a PCA feature).

With background removed from the CO2 concentrations, we can now use the adjusted data to identify concentration peaks. Concentration peaks indicate potential sources of the pollutant. In this case, a CO2 peak could be a large vehicle like a truck.

#-----id peaks-----

# specify the time in minutes for the rolling window
time_window <- 1

# specify windsorize cut-point (percentile)
w_pct <- 0.95

# specify cutoff for peak change in slope values
pqv <- 0.6

# we can use use `slider` to calculate these rolling statistics
# use our previous adjusted co2 from the rolling 5th percentile (co2_adj_pct)
ts_data <- ts_data %>% 
  mutate(
         # calculate adjusted CO2 concentration
         co2_adj = (co2_conc - co2_background_min),
         
         # identify the "next" value in the series with `lead()`
         co2_lag = lead(co2_adj), 
         
         # calculate squared difference in consecutive CO2 concentrations
         co2_diff_sq = (co2_adj - co2_lag)^2,
         
         # limit extreme values by windsorizing
         co2_diff_sq = replace(co2_diff_sq, co2_diff_sq > 
                                 quantile(co2_diff_sq, w_pct, na.rm = TRUE), 
                               quantile(co2_diff_sq, w_pct, na.rm = TRUE)),
         
         # calculate a 1 minute rolling mean of CO2 differences
         roll_co2_diff = slide_dbl(co2_diff_sq, 
                                   ~mean(.x, na.rm = TRUE), 
                                         .before = (time_window*60)/2, 
                                         .after = (time_window*60)/2),
         # calculate peaks
         co2_is_peak = if_else(roll_co2_diff >= 
                                 quantile(roll_co2_diff, pqv, na.rm = TRUE), 
                               TRUE, FALSE)
         ) 
#-----plot co2 peaks-----

ggplot(ts_data) + 
  
  # plot co2 observations
  geom_point(aes(x = datetime, y = co2_conc, color = co2_is_peak), size = 0.8) +
  
  # plot background
  geom_line(aes(x = datetime, y = co2_background_pct)) + 
  
  # set colors
  scale_colour_manual(values = c("darkgray", "darkred")) +
  
  # labels
  labs(x = "Time", y = "CO2 Peaks with Background (ppm)") +
  
  # theme
  theme_bw() + 
  theme(legend.position = "none")

Local Emission Plume Detection with P-Trak

As the CO2 map shows, the data we’ve collected represent concentrations over 4 hours, meaning some of the variability we’re observing could be a mix of spatial (i.e. the variability we see in the map) and temporal (i.e. the variability observed in time series plots) processes.

Next, we’ll use the P-Trak emission ratio data to explore the idea of pollutant emission ratios described by Kolb et al., 2004, which normalizes pollutant concentrations to CO2 concentrations, potentially helping identify emission sources in data with both temporal and spatial variability. We’ll use the P-Trak data to explore the idea of pollutant ratios by plotting the proportion of smallest particles (20-36 nm to the total concentration of UFP particles).

#-----map emission.ratios-----

# calculate P-Trak ratio
ts_data <- ts_data %>% 
  mutate(ptrak_ratio = (ptrak_noscreen_conc - ptrak_screen_conc) / ptrak_noscreen_conc)

# map ratios that are greater than zero 
map_ratios <- basemap +
    
# locations with concentrations
geom_point(data = ts_data %>% drop_na(ptrak_ratio), 
           aes(x = longitude, y = latitude, color = ptrak_ratio)) + 

# color scale
scale_color_continuous(type = "viridis", limits = c(0,1)) + 

# labels
labs(color = "P-trak Ratio \nscreened:unscreened") +

# theme
theme_void() 

# show map
map_ratios

Practice Session

Use the data from car1 or car2 to repeat a subset of the analyses demonstrated above for ultrafine particles and write up a coherent report about the patterns you see. The goal of this lab is for you to use the code we have provided to read in time-varying data from multiple instruments, check that they were read in appropriately, and address a few scientific questions about time-varying exposures. Here are the topics for you to address in your lab report:

  1. Data description, data quality, and approach to analysis:
    1. How many observations were available, and how many were usable after preliminary data management (e.g. dropping NAs after merging)?
    2. What variables are available and which ones are of scientific interest?
    3. When were the data collected and in what area?
    4. What steps did you follow to accomplish the scientific analyses you discuss in the rest of the report?
  2. Scientific analyses:
    1. Plot a time series plot for the screened and unscreened P-Trak data on one plot. Do you see a different pattern for the two measurements of ultrafine particles?
    2. Look at the autocorrelation in the unscreend P-Trak data. Compare autocorrelation plots at the 1-second time scale with autocorrelation plots at the 5-minute time scale. Describe the patterns and comment on their similarities and differences.
    3. Choose one additional analysis of interest to you that will allow you to highlight a feature of the time-varying data that is of scientific interest to you. Justify your reasoning for your choice, leveraging your understanding from the lecture and/or Brantley paper. For instance, you might consider comparing different background estimation approaches, showing a time series with background overlaid or removed, comparing the difference between the screened and unscreened P-Traks over space and/or time, or, identifying peak concentrations. Present an appropriate choice of plots and/or maps to support your analysis. Interpret these and connect your interpretation to the material discussed in the lecture.

Code Appendix

Session Information

#-----session info: beginning of Code Appendix-----

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggrepel_0.9.6    sf_1.0-17        maptiles_0.8.0   ggspatial_1.1.9 
 [5] slider_0.3.1     feasts_0.4.1     fabletools_0.5.0 tsibble_1.1.5   
 [9] purrr_1.0.2      lubridate_1.9.3  hms_1.1.3        ggplot2_3.5.1   
[13] stringr_1.5.1    readr_2.1.5      tidyr_1.3.1      dplyr_1.1.4     
[17] knitr_1.48       pacman_0.5.1    

loaded via a namespace (and not attached):
 [1] gtable_0.3.5         anytime_0.3.9        xfun_0.48           
 [4] htmlwidgets_1.6.4    tzdb_0.4.0           vctrs_0.6.5         
 [7] tools_4.3.1          generics_0.1.3       curl_5.2.3          
[10] parallel_4.3.1       tibble_3.2.1         proxy_0.4-27        
[13] fansi_1.0.6          pkgconfig_2.0.3      KernSmooth_2.23-24  
[16] distributional_0.5.0 lifecycle_1.0.4      farver_2.1.2        
[19] compiler_4.3.1       munsell_0.5.1        terra_1.7-78        
[22] codetools_0.2-20     htmltools_0.5.8.1    class_7.3-22        
[25] yaml_2.3.10          crayon_1.5.3         pillar_1.9.0        
[28] ellipsis_0.3.2       slippymath_0.3.1     classInt_0.4-10     
[31] tidyselect_1.2.1     digest_0.6.37        stringi_1.8.4       
[34] labeling_0.4.3       fastmap_1.2.0        grid_4.3.1          
[37] colorspace_2.1-1     cli_3.6.3            magrittr_2.0.3      
[40] utf8_1.2.4           e1071_1.7-16         withr_3.0.1         
[43] scales_1.3.0         warp_0.2.1           bit64_4.5.2         
[46] timechange_0.3.0     rmarkdown_2.28       bit_4.5.0           
[49] png_0.1-8            evaluate_1.0.0       viridisLite_0.4.2   
[52] rlang_1.1.4          Rcpp_1.0.13          glue_1.8.0          
[55] DBI_1.2.3            rstudioapi_0.16.0    vroom_1.6.5         
[58] jsonlite_1.8.9       R6_2.5.1             units_0.8-5         

Code in the R Markdown File

#-----setup-----

# set knitr options
knitr::opts_chunk$set(echo = TRUE)

# clear work space of all objects and unload all extra (non-base) packages
rm(list = ls(all = TRUE))
if (!is.null(sessionInfo()$otherPkgs)) {
    res <- suppressWarnings(
        lapply(paste('package:', names(sessionInfo()$otherPkgs), sep=""),
               detach, character.only=TRUE, unload=TRUE, force=TRUE))
    rm(res)
}

#-----load libraries-----

# Load pacman into memory, installing as needed
my_repo <- 'http://cran.r-project.org'
if (!require("pacman")) {install.packages("pacman", repos = my_repo)}

# Load the other packages, installing as needed.
pacman::p_load(knitr, dplyr, tidyr, readr, stringr, ggplot2, hms, 
               lubridate, purrr, tsibble, feasts, slider, #ggmap, 
               ggspatial, maptiles, sf, # mapping
               ggrepel)

#-----directory organization and read data-----

# specify data directory
data_dir <- file.path("Datasets", "mobile_monitoring")

# name destination file
dest <- file.path(data_dir, "Jul10MOV-UP.zip")
  
# Download files if not already present
if (!dir.exists(dest)) {
    
  # create "Datasets" and "mobile_monitoring" directories if they do not already
  # exist
  dir.create(file.path("Datasets", "mobile_monitoring"), 
             showWarnings = FALSE, recursive = TRUE)

  # specify URL of zip file
  url <- "https://faculty.washington.edu/sheppard/envh556/Datasets/mobile_monitoring/Jul19MOV-UP.zip"
  
  # download zip
  download.file(url = url, destfile = dest)
  
  # unzip file
  unzip(zipfile = dest, exdir = data_dir)

  # # delete zip files (optional)
  # unlink(dest)
  # unlink(str_subset("__MACOSX", list.files(data_dir)), recursive = TRUE)
  
}

#-----datetime.and.date objects-----

# show POSIXct object -- a datetime object with both date and time
Sys.time()

# show class
# Note that 'POSIXt' is a virtual class which cannot be used directly. This
# virtual class exists from which both of the classes (POSIXct and POSIXlt)
# inherit. POSIXt is used to allow operations, such as subtraction, to mix the
# two classes.
class(Sys.time())

# show Date object-- date without time included
Sys.Date()

# show class
class(Sys.Date())

#-----gather file names for import-----

# get filename paths for data files
filenames <- list.files(data_dir) %>% str_subset("car1")

#-----gps-----

# Get GPS file name  
gps_file <- str_subset(filenames, "GPS")

# read GPS data
gps <- read_csv(file.path(data_dir, gps_file)) %>%
   
   # set column names
   set_names(c("date", "time", "latitude", "longitude", "altitude", "speed")) %>% 
   
   # create datetime column (notice we must specify the timezone)
   mutate(datetime = as.POSIXct(paste(date, time), tz = "US/Pacific"))

# show dataframe
gps

# and time zone
tz(gps$datetime)

#-----p-trak-----

# get p-trak not-screened file names
ptrak_noscreen_file <- filenames %>% 
   str_subset("PT") %>% 
   str_subset("scrnd|Scrnd|Screened|screen", negate = TRUE)

# read data
ptrak_noscreen <-  read_tsv(file.path(data_dir, ptrak_noscreen_file), skip = 29) %>% 
   
   # name columns
   set_names(c("date","time", "ptrak_noscreen_conc")) %>% 
   
   # modify date column and create datetime column (notice we must specify the
   # date format)
   mutate(date = as.Date(date, format = "%m/%d/%Y"), 
          datetime = as.POSIXct(paste(date, time), tz = "US/Pacific") )


# get p-trak-screened file names and read in data
ptrak_screen_file <- filenames %>% 
   str_subset("scrnd|Scrnd|Screened|screen")

# read data
ptrak_screen <-  read_tsv(file.path(data_dir, ptrak_screen_file), skip = 29) %>% 
   
   # name columns
   set_names(c("date","time", "ptrak_screen_conc")) %>% 
   
   # modify date column and create datetime column
   mutate(date = as.Date(date, format = "%m/%d/%Y"), 
          datetime = as.POSIXct(paste(date, time), tz = "US/Pacific") )


# show dataframes - have PNC measurements every second
ptrak_noscreen
ptrak_screen

#-----co2-----

# get CO2 file name
co2_file <- str_subset(filenames, "CO2")

# read data (tab separated values), specify time as character because of time parsing issue
co2 <- read_tsv(file.path(data_dir, co2_file), skip = 1, 
                col_types = cols(`System_Time_(h:m:s)` = col_character()) 
                ) %>% 
   
   # name columns
   setNames(c("date", "time", "co2_conc", "h2o_conc", "temp", "pressure", 
              "co2_absorp", "flow") ) %>% 
   
   # select columns of interest
   select(date, time, co2_conc) %>% 
   
   # address time parsing (some minute values have only one digit) and 
   # create datetime column
   mutate(time = as_hms(time), 
          datetime = as.POSIXct(paste(date, time), tz = "US/Pacific") )

# show dataframe - have CO2 measurements every second
head(co2)

str(co2$datetime)
tz(co2$datetime)

#-----merge datasets-----

# create a list of all data to merge 
instrument_list <- list(gps = gps, ptrak_noscreen = ptrak_noscreen, 
                        ptrak_screen = ptrak_screen, co2 = co2)

# use `purrr::reduce` to run `full_join()` on listed items 
all_data <- reduce(instrument_list, full_join) %>% 
   
   # reorder columns so datetime is first
   relocate(datetime) %>% 
   
   # arrange data in datetime order
   arrange(datetime)

# use glimpse to learn more about the data
glimpse(all_data)

#-----plot data-----

# note that these data are all for one day
plot_date <- first(all_data$date)

# transform data from wide to long for ggplot
all_data %>% pivot_longer(cols = contains("_conc"), 
                          names_to = "pollutant", 
                          values_to = "concentration") %>% 
# pipe into ggplot
ggplot(aes(x = datetime, y = concentration)) + 
  
  # plot each pollutant in a facet
  facet_wrap(~pollutant, ncol = 1, scales = "free_y") + 
  
  # specify type of plot
  geom_line() +
  
  # specify theme
  theme_bw() + 
  labs(title = plot_date)

#-----data availability plot-----

# transform data from wide to long for ggplot (using latitude for GPS)
all_data %>% 
  mutate(GPS = latitude) %>%
  pivot_longer(cols = c(contains("conc"), GPS), 
               names_to = "measurement") %>%

# make plot
ggplot(aes(x = datetime, y = measurement, color = factor(measurement)) ) +
  geom_point(size = 0.5, alpha=0.4) + 
  labs(x = "Time", y = "Instrument") + 
  theme_bw() + 
  theme(legend.position = "none")

#-----missing-----

lapply(all_data, function(i){ 
   
   tibble( 
          # sum missing
          n_miss = sum(is.na(i)), 
          
          # percent missing
          perc_miss = round(n_miss/length(i) * 100, 1)
          )
   }) %>% 
   
   # bind list
   bind_rows(.id = "variable")

#-----time stamps-----

# we'll use the instrument data list
lapply(instrument_list, function(i){
   
   # get vector of datetimes
   datetimes <- i[["datetime"]]
   
   # get a vector of the differences between consecutive measurements
   time_diff <- datetimes - lag(datetimes)
   
   # summarize differences
   table(time_diff)
   
}) %>% 
   
   # use list names
   set_names(names(instrument_list))


#  Most measuremnets occur every second.
# note differences in the GPS - a couple of measurements appear to be duplicates. A handful of measurements have larger gaps. 

#-----drop missing gps-----

all_data <- drop_na(all_data, latitude, longitude)

#-----try tsibble-----

# # use "try" here so document knits
# try( as_tsibble(all_data, index = datetime) )

# inspect duplicate rows
duplicates(all_data, index = datetime)

#-----to tsibble-----

# remove duplicate rows, and convert to `tsibble`
ts_data <- all_data %>% 
  distinct(datetime, .keep_all = TRUE) %>% 
  as_tsibble(index = datetime)

head(ts_data)

#-----fill ts-----

# count rows before filling
paste("number of rows before filling gaps:", nrow(ts_data))

# save for calculation below 
temp <- ts_data

# Turn implicit missing values into explicit missing values (add NAs to data)
ts_data <- fill_gaps(ts_data) 
  
# # if you wanted to "fill" the new explicitly missing data with a value drawn
# # from the dataset, here is an option. (add a `%>%` to the function above). 
# # Note this could have downstream effects. 
# fill(co2_conc, .direction = "down")

# count rows after filling
paste("number of rows after filling gaps:", nrow(ts_data))


# see what rows were added - these contain NA values other than dates
differences_df <- anti_join(ts_data, temp)
head(differences_df)

paste("number of rows added:", nrow(differences_df))

#-----autocorrelation-----

# autocorrelation plot - shows the overall correlation structure across lags
## calculates correlation between a conc and its lagged versions (both immediate/direct and indirect)

## ACF measures the linear (Pearson correlation coef, R) correlation between a time series and its lagged versions in a tsibble.  

## plot: A high positive correlation suggests that the values are similar between the time points separated by the lag. Values near zero indicate no correlation at that lag. 
## The blue dashed line may be the approximate 95% CI where no autocorrelation exists in the data.
ts_data %>% ACF(co2_conc,  # column used for computation
                lag_max = 60 # maximum lag at which to calculate the acf
                ) %>% autoplot()



# partial autocorrelation plot
## shows only the correlation between a conc (e.g. at time t) and a lag conc  (e.g. at time t-2), adjusting for the influence of intermediate conc lags(e.g. at time t-1).

## The first lag (lag 1) has a high and statistically significant partial autocorrelation (above the blue dashed line), indicating that the immediate prior value of the time series has a strong direct influence on the current value.
## After lag 1 (and possibly lag 2 or 4), the partial autocorrelations become less significant (fall within the blue dashed lines). Beyond the first lag, there is thus minimal direct influence of earlier values on the current value once the effects of lag 1 (and other significant lags) are accounted for.

ts_data %>% PACF(co2_conc, lag_max = 60) %>% autoplot()

#-----slider-----

ts_data <- ts_data %>% 
  
  # left join new 1-minute period values
  left_join(

    # Approach 1: aggregate data to 1 min averages using fixed period intervals
    ## returns a dataframe with repeated 1 min avg measures every second
    slide_period_dfr(ts_data, ts_data$datetime, "minute",
                     ~tibble(datetime = floor_date(.x$datetime),
                             co2_conc_one = mean(.x$co2_conc)
                             ) ),
    by = "datetime" ) %>% 
  
  # add a datetime column with 1-minute precision to help if you choose to
  # `group_by()` and `summarise()` 1-minute averages
  mutate(datetime_one = ymd_hm(format(datetime, "%Y-%m-%d %H:%M"), 
                               tz = "US/Pacific")) %>%
  
  # Approach 2: calculate moving averages, e.g., 2 minutes to provide a smoothed version of the time series(this works because we've made sure our datetime index is unique and complete)
  mutate(co2_conc_two = slide_dbl(co2_conc, ~mean(.x, na.rm = TRUE), 
                                  .before = 60, .after = 60) ) 

# # slide_period also works in the same way, but it returns a vector of different
# # length than the original inputs
# with(ts_data, 
#      slide_period_dbl(.x = co2_conc, .i = datetime, 
#              .period = "minute", .f = mean, na.rm = TRUE) 
#      )

# glimpse
glimpse(ts_data)

# compare a few values. convert to dataframe since R does not round/limit the displayed digits
head(select(ts_data, contains(c("datetime", "co2"))) %>% as.data.frame())

#-----new timescales-----

ts_new <- ts_data %>% 
  
  # get the "floor" of each datetime row (unfortunately `tsibble` doesn't let us
  # use "datetime" for this new variable name)
  index_by(datetime_new = floor_date(datetime, unit = "5mins")) %>%
  
  # summarise the mean of rows across all dataframe columns
  summarise(across(where(is.numeric), mean, na.rm = TRUE ), .groups = "drop") %>% 
  
  # rename to get "datetime" variable name back
  rename(datetime = datetime_new) %>% 
  
  # create variable so new 5 min averages are aligned (i.e., centered) with originals for plots (because of `floor_date()` behavior)
  mutate(plot_time = datetime + minutes(2) + seconds(30))
  
# glimpse
glimpse(ts_new)

#---- 5-min autocorrelation ----

# check 5-minute average autocorrelation
ts_new %>% ACF(co2_conc) %>% autoplot()

#-----plot different time averages-----

# we're going to use a simpler approach here and just manually specify each
# series rather than make a dataframe with all the plotting data combined

ggplot() + 
  
  # plot original 1-second data
  geom_line(data = ts_data, aes(x = datetime, y = co2_conc, color = "blue")) + 
  
  # plot `slider` 1-min period averages
  geom_line(data = ts_data, aes(x = datetime, y = co2_conc_one, color = "black")) +
  
  # plot `slider` 2-min moving averages
  geom_line(data = ts_data, aes(x = datetime, y = co2_conc_two, color = "red")) +
  
  # plot `tsibble` 5-minute block averages (note `x = plot_time`)
  geom_line(data = ts_new, aes(x = plot_time, y = co2_conc, color = "darkgreen")) +  
  
  # specify legend values manually
  scale_color_identity(name = "Time Average", 
                       breaks = c("blue", "black", "red", "darkgreen"), 
                       labels = c("1-sec", "1-min", "2-min", "5-min"), 
                       guide = "legend") +
  
  # labels
  labs(x = "Time", 
       y = "CO2 (ppm)" 
       ) +
  
  # theme
  theme_bw() + 
  theme(legend.position = "bottom")

#-----get map objects-----
# WGS84 latitude-longitude. same as: "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
latlong_proj <- 4326  

# projection in meters we need for distance calculations
lambert_proj <- "+proj=lcc +lat_1=33 +lat_2=45 +lat_0=39 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs" 


# define the bounding box for the map
bbox <- ts_data %>% 
  
  # first filter to time period (and therefore geographic area) of interest
  filter(between(datetime, 
                 as.POSIXct("2018-07-19 13:13:00", tz = "US/Pacific"),
                 as.POSIXct("2018-07-19 16:32:00", tz = "US/Pacific")) ) %>% 
  drop_na(latitude, longitude) %>%
  st_as_sf(coords = c('longitude', 'latitude'), crs = latlong_proj) %>%
  # add a buffer (in meters)
  st_transform(crs = lambert_proj) %>%
  st_buffer(1e4) %>%
  st_transform(crs = latlong_proj) %>%
  # then pipe into `make_bbox()` 
  #make_bbox()
  st_bbox()

# note you could write code to save this tile plot for offline use. This can have technical issues, however, so we won't do that here. 

tiles <- maptiles::get_tiles(x = bbox, provider = "OpenStreetMap", zoom = 12)

basemap <- ggplot() +
    # Add basemap tiles as background
    ggspatial::layer_spatial(tiles) + 
    theme_minimal()

basemap

#-----get time stamps-----

map_times <- ts_data %>% 
  
  # convert to `tibble` because we're breaking up our `tsibble`
  as_tibble() %>%
  
  # filter to rows with :00, :15, :30, and :45 minute times
  filter(str_detect(time, pattern = ":00:00|:15:00|:30:00|:45:00")) %>% 
  
  # make a map_time variable, dropping the seconds 
  # (`$` indicates the character pattern should be at the end of the string)
  mutate(map_time = str_remove(time, pattern = ":00$"))

#-----map concentrations-----

# map concentrations
map_co2 <- basemap +
  
  # map locations with concentrations
  geom_point(data = drop_na(ts_data, co2_conc), 
             aes(x = longitude, y = latitude, color = co2_conc)) + 
  
  # add measurement times to map
  geom_text_repel(data = map_times,
                  aes(x = longitude, y = latitude, label = map_time), 
                  force_pull = -0.02 ) +
  
  # color scale
  scale_color_continuous(type = "viridis") +  
  
  # labels
  labs(color = "CO2 (ppm)") +
  
  # Bounding box limits for latitude and longitude
  coord_sf(xlim = c(bbox$xmin, bbox$xmax),
           ylim = c(bbox$ymin, bbox$ymax)) +
  
  # theme
  theme_void() 

# show map
map_co2

#-----calculate background with moving functions-----

# specify the time in minutes for moving calculations 
# (for percentile `_pct` and minimum `_min`)
time_pct <- 5
time_min <- 30

# specify the percentile considered to be background
pct <- 0.05

# use use `slider` to add moving background calculations to `ts_data`. Then
# calculate an adjusted concentration by removing the background concentration.
# For moving percentile, assign new names with suffix `_pct`, and for moving
# minimums use `_min`. `min()` will return Inf when all values are NA; use
# `na_if()` to avoid.
ts_data <- ts_data %>% 
  mutate( 
         # using moving percentiles
         co2_background_pct = slide_dbl(co2_conc, 
                                    ~quantile(.x, probs = pct, na.rm = TRUE), 
                                    .before = (time_pct*60)/2, 
                                    .after = (time_pct*60)/2), 
         co2_adj_pct = co2_conc - co2_background_pct, 
         
         # using moving minimums
         co2_background_min = slide_dbl(co2_conc, 
                                 ~min(.x, na.rm = TRUE), 
                                 .before = (time_min*60)/2, 
                                 .after = (time_min*60)/2), 
         co2_background_min = na_if(co2_background_min, Inf), 
         co2_adj_min = co2_conc - co2_background_min
         )

#-----plot background-----

# plot the two background estimates on the same plot
ggplot(data = ts_data) + 
  geom_line(aes(x = datetime, y = co2_background_pct, color = "black")) + 
  geom_line(aes(x = datetime, y = co2_background_min, color = "red")) + 
  labs(x = "Time", y = "CO2 Background (ppm)") +

# specify legend values manually
scale_color_identity(name = "Background Type",
                   breaks = c("black", "red"),
                   labels = c("Percentile", "Minimum"),
                   guide = "legend") +
# theme
theme_bw() + 
theme(legend.position = "bottom")
  
#-----loess smooth of moving background-----

# specify loess models
mdl_pct <- loess(co2_background_pct ~ seq_along(time), data = ts_data, 
                 span = 0.1, degree = 1)

mdl_min <- loess(co2_background_min ~ seq_along(time), data = ts_data, 
                 span = 0.1, degree = 1)

# add smoothed predictions of background to `ts_data`
ts_data <- ts_data %>% 
  mutate(co2_background_pctl = predict(mdl_pct, ts_data),
         co2_background_minl = predict(mdl_min, ts_data)) 

# prepare dataframe for ggplot 
# (creating new columns for our different types of estimates)
ts_data %>% pivot_longer(contains("_background_")) %>% 
  mutate(min_pct = if_else(str_detect(name, "min"), "minimum", "percentile"), 
         smooth_un = if_else(str_detect(name, "l"), "smoothed", "unsmoothed") ) %>% 

# compare all background estimates
ggplot(aes(x = datetime, y = value, 
           linetype = interaction(smooth_un, min_pct, sep = " "), 
           color = interaction(smooth_un, min_pct, sep = " ") ) ) + 
  
  # plot lines
  geom_line() + 
  
  # set colors
  scale_color_manual(name = "Background Type", 
                     values = c("red", "red", "black", "black")) +
  
  # set linetype
  scale_linetype_manual(name = "Background Type", 
                        values = c("dashed", "solid", "dashed", "solid")) + 
  
  # labels
  labs(x = "Time", y = "CO2 Background (ppm)") +
  
  # theme
  theme_bw() + 
  theme(legend.position = "bottom")

#-----plot co2 and background-----

ggplot(data = ts_data) + 
  
  # plot the co2 observations
  geom_point(aes(x = datetime, y = co2_conc), size = 0.8, color = "darkgray") +
  
  # plot co2 background using the loess smoothed and unsmoothed rolling minimums
  geom_line(aes(x = datetime, y = co2_background_minl), lty = "dashed", color = "red") + 
  geom_line(aes(x = datetime, y = co2_background_min), lty = "solid", color = "red") + 
  
  # labels
  labs(x = "Time", y = "CO2 with Background (ppm)") +
  
  # theme
  theme_bw()

#-----id peaks-----

# specify the time in minutes for the rolling window
time_window <- 1

# specify windsorize cut-point (percentile)
w_pct <- 0.95

# specify cutoff for peak change in slope values
pqv <- 0.6

# we can use use `slider` to calculate these rolling statistics
# use our previous adjusted co2 from the rolling 5th percentile (co2_adj_pct)
ts_data <- ts_data %>% 
  mutate(
         # calculate adjusted CO2 concentration
         co2_adj = (co2_conc - co2_background_min),
         
         # identify the "next" value in the series with `lead()`
         co2_lag = lead(co2_adj), 
         
         # calculate squared difference in consecutive CO2 concentrations
         co2_diff_sq = (co2_adj - co2_lag)^2,
         
         # limit extreme values by windsorizing
         co2_diff_sq = replace(co2_diff_sq, co2_diff_sq > 
                                 quantile(co2_diff_sq, w_pct, na.rm = TRUE), 
                               quantile(co2_diff_sq, w_pct, na.rm = TRUE)),
         
         # calculate a 1 minute rolling mean of CO2 differences
         roll_co2_diff = slide_dbl(co2_diff_sq, 
                                   ~mean(.x, na.rm = TRUE), 
                                         .before = (time_window*60)/2, 
                                         .after = (time_window*60)/2),
         # calculate peaks
         co2_is_peak = if_else(roll_co2_diff >= 
                                 quantile(roll_co2_diff, pqv, na.rm = TRUE), 
                               TRUE, FALSE)
         ) 

#-----plot co2 peaks-----

ggplot(ts_data) + 
  
  # plot co2 observations
  geom_point(aes(x = datetime, y = co2_conc, color = co2_is_peak), size = 0.8) +
  
  # plot background
  geom_line(aes(x = datetime, y = co2_background_pct)) + 
  
  # set colors
  scale_colour_manual(values = c("darkgray", "darkred")) +
  
  # labels
  labs(x = "Time", y = "CO2 Peaks with Background (ppm)") +
  
  # theme
  theme_bw() + 
  theme(legend.position = "none")

#-----map emission.ratios-----

# calculate P-Trak ratio
ts_data <- ts_data %>% 
  mutate(ptrak_ratio = (ptrak_noscreen_conc - ptrak_screen_conc) / ptrak_noscreen_conc)

# map ratios that are greater than zero 
map_ratios <- basemap +
    
# locations with concentrations
geom_point(data = ts_data %>% drop_na(ptrak_ratio), 
           aes(x = longitude, y = latitude, color = ptrak_ratio)) + 

# color scale
scale_color_continuous(type = "viridis", limits = c(0,1)) + 

# labels
labs(color = "P-trak Ratio \nscreened:unscreened") +

# theme
theme_void() 

# show map
map_ratios

#-----session info: beginning of Code Appendix-----

sessionInfo()

#-----appendix code-----

#-----functions used in this Rmd-----

# Show the names of all functions used (loaded in the current environment)
lsf.str()

# Show the definitions of all functions loaded into the current environment  
lsf.str() %>% set_names() %>% map(get, .GlobalEnv)

User-Written Functions Loaded in the R Markdown Environment

#-----functions used in this Rmd-----

# Show the names of all functions used (loaded in the current environment)
lsf.str()

# Show the definitions of all functions loaded into the current environment  
lsf.str() %>% set_names() %>% map(get, .GlobalEnv)
named list()